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Abstract In this paper we propose a variant of the random coordinate descent 
method for solving linearly constrained convex optimization problems with com- 
posite objective functions. If the smooth part of the objective function has Lip- 
schitz continuous gradient, then we prove that our method obtains an e-optimal 
solution in 0(N'^ /e) iterations, where A'' is the number of blocks. For the class of 
problems with cheap coordinate derivatives we show that the new method is faster 
than methods based on full-gradient information. Analysis for the rate of conver- 
gence in probability is also provided. For strongly convex functions our method 
converges linearly. Extensive numerical tests confirm that on very large problems, 
our method is much more numerically efficient than methods based on full gradient 
information. 

Keywords Coordinate descent ■ composite objective function • linearly coupled 
constraints ■ randomized algorithms ■ convergence rate 0{l/e). 



1 Introduction 

The basic problem of interest in this paper is the following convex minimization 
problem with composite objective function: 

minF(a;) {:= f{x) + h{x)) 

(1) 

s.t.: a a; = 0, 

where / : R" ^ M is a smooth convex function defined by a black-box oracle, 
h : R" — >■ M is a general closed convex function and a G R". Further, we assume 
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that function h is coordinatewise separable and simple (by simple we mean that 
we can find a closed-form solution for the minimization of h with some simple 
auxiliary function). Special cases of this model include linearly constrained smooth 
optimization (where h = 0) which was analyzed in [161131] , support vector machines 
(where h is the indicator function of some box constraint set) [101114] and composite 
optimization (where a = 0) [23ll28l[29ll30] . 

Linearly constrained optimization problems with composite objective function 
arise in many applications such as compressive sensing [S], image processing [S], 
truss topology design [T7] , distributed control [T5] , support vector machines [29] , 
traffic equilibrium and network flow problems [3] and many other areas. For prob- 
lems of moderate size there exist many iterative algorithms such as Newton, quasi- 
Newton or projected gradient methods [8l[9lll3]. However, the problems that we 
consider in this paper have the following features: the dimension of the optimization 
variables is very large such that usual methods based on full gradient computations 
are prohibitive. Moreover, the incomplete structure of information that may appear 
when the data are distributed in space and time, or when there exists lack of 
physical memory and enormous complexity of the gradient update can also be an 
obstacle for full gradient computations. In this case, it appears that a reasonable 
approach to solving problem ([T]) is to use (block) coordinate descent methods. These 
methods were among the first optimization methods studied in literature |3j . The 
main differences between all variants of coordinate descent methods consist of the 
criterion of choosing at each iteration the coordinate over which we minimize our 
objective function and the complexity of this choice. Two classical criteria, used 
often in these algorithms, are the cyclic and the greedy (e.g., Gauss-Southwell) co- 
ordinate search, which significantly differ by the amount of computations required 
to choose the appropriate index. The rate of convergence of cyclic coordinate 
search methods has been determined recently in \I\\27\ . Also, for coordinate de- 
scent methods based on the Gauss-Southwell rule, the convergence rate is given in 
|28ll29l[30] . Another interesting approach is based on random coordinate descent, 
where the coordinate search is random. Recent complexity results on random co- 
ordinate descent methods were obtained by Nesterov in [20] for smooth convex 
functions. The extension to composite objective functions was given in [231124] and 
for the grouped Lasso problem in [22]. However, all these papers studied optimiza- 
tion models where the constraint set is decoupled (i.e., characterized by Cartesian 
product). The rate analysis of a random coordinate descent method for linearly 
coupled constrained optimization problems with smooth objective function was 
developed in [16] . 

In this paper we present a random coordinate descent method suited for large 
scale problems with composite objective function. Moreover, in our paper we focus 
on linearly coupled constrained optimization problems (i.e., the constraint set is 
coupled through linear equalities) . Note that the model considered in this paper is 
more general than the one from [16] . since we allow composite objective functions. 
We prove for our method an expected convergence rate of order C'(-^), where 
N is number of blocks and k is the iteration counter. We show that for functions 
with cheap coordinate derivatives the new method is much faster, either in worst 
case complexity analysis, or numerical implementation, than schemes based on 
full gradient information (e.g., coordinate gradient descent method developed in 
[30]*). But our method also offers other important advantages, e.g., due to the ran- 
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domization, our algorithm is easier to analyze and implement, it leads to more 
robust output and is adequate for modern computational architectures (e.g, par- 
allel or distributed architectures). Analysis for rate of convergence in probability 
is also provided. For strongly convex functions we prove that the new method con- 
verges linearly. We also provide extensive numerical simulations and compare our 
algorithm against state-of-the-art methods from the literature on three large-scale 
applications: support vector machine, the Chebyshev center of a set of points and 
random generated optimization problems with an ^i-regularization term. 

The paper is organized as follows. In order to present our main results, we 
introduce some notations and assumptions for problem ([l]) in Section ll.il In Sec- 
tion[2]we present the new random coordinate descent (RCD) algorithm. The main 
results of the paper can be found in Section [3l where we derive the rate of con- 
vergence in expectation, probability and for the strongly convex case. In Section 
in we generalize the algorithm and extend the previous results to a more general 
model. We also analyze its complexity and compare it with other methods from 
the literature, in particular the coordinate descent method of Tseng [30j in Sec- 
tion [5] Finally, we test the practical efficiency of our algorithm through extensive 
numerical experiments in Section [6l 



1.1 Preliminaries 

We work in the space R" composed of column vectors. For x,y £ R" we denote: 

n 

1=1 

We use the same notation (•, ■) for spaces of different dimensions. If we fix a norm 
II -11 in R", then its dual norm is defined by: 

\\y\\* = max {y,x). 

\\x\\ = l 

We assume that the entire space dimension is decomposable into A'' blocks: 

N 



i=l 



We denote by Ui the blocks of the identity matrix: 

In = [Ui ... Un] , 

where Ui € R"^"'. For some vector x £ R", we use the notation Xi for the ith 
block in x, \7if{x) = UfV f{x) is the ith block in the gradient of the function 

coordinates in x. Given a matrix A G R™'*", we denote its nuUspace by Null{A). 
In the rest of the paper we consider local Euclidean norms in all spaces R"*, i.e., 
\\xi\\ = ^{xi)'^Xi for all Xi e R"' and i = I, . . . ,N . 
For model ^ we make the following assumptions: 



/ at X, and Vijf{x) 



We denote by supp{x) the number of nonzero 
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Assumption 1 The smooth and nonsmooth parts of the objective function in opti- 
mization model (m satisfy the following properties: 

(i) Function f is convex and has block- coordinate Lipschitz continuous gradient: 

\\VJ{x + U,h,)-\/J{x)\\<L,\\h,\\ Vx-gR", /t, gM"% i = l,...,N. 
(ii) The nonsmooth function h is convex and coordinatewise separable. 

Assumption [1] (i) is typical for composite optimization, see e.g., [191130] . Assump- 
tion [T] (ii) covers many applications as we further exemplify. A special case of 
coordinatewise separable function that has attracted a lot of attention in the area 
of signal processing and data mining is the i?i-regularization [22] : 

h{x) = \\\x\\^, (2) 

where A > 0. Often, a large A factor induces sparsity in the solution of optimiza- 
tion problem 1^. Note that the function h in ^ belongs to the general class of 
coordinatewise separable piecewise linear/quadratic functions with 0(1) pieces. 
Another special case is the box indicator function, i.e.: 

f,^ ^ 1 io, I < X < u 

I oo, otherwise. 

Adding box constraints to a quadratic objective function / in ^ leads e.g., to 
support vector machine (SVM) problems [71129] . The reader can easily find many 
other examples of function h satisfying Assumption [1] (ii). 
Based on Assumption 1 (i) , the following inequality can be derived [18] : 

f{x + U,K)<f{x) + {V,f{x),h,) + ^\\h£ Vx^gM", fe, gR"'. (4) 
In the sequel, we use the notation: 

L = max Li. 

l<i<N 

For a G [0, 1] we introduce the extended norm on R" similar as in [20] : 

/ N 



N 
i=l 



and its dual norm 



Note that these norms satisfy the Cauchy-Schwartz inequality: 

ll^llall?/|i:>(^,J/> Vx,yGR". 

For a simpler exposition we use a context-dependent notation as follows: let x G R" 
such that X = YliLi^iXi, then Xij G R"'+"j denotes a two component vector 



Xij — 

i.e., y + Xij, we understand y + UiXi + UjXj. Also, by inner product {y,Xij) with 
vectors y from the extended space R" we understand {y,Xij) = {yi,Xi) + {yj,Xj). 
Based on Assumption [T](ii) we can derive from the following result: 



Moreover, by addition with a vector in the extended space y G 
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Lemma 1 Let function f be convex and satisfy Assumption\l\ Then, the function f 
has componentwise Lipschitz continuous gradient w.r.t. every pair i.e.: 



V,f{x+U,s,+UjSj) 
\7jfix+U^s, + UjSj) 



\/x e : 



, 8, e 



where we define if- = L] " + L^J 



Proof Let /* = min f{x). Based on Q we have for any pair 

fix) -f*> max \\Vifix)f > max \\Vif{x)f 

ie{i....N} 2Li ie{i,j} ^^i 

7^llV,/(x)f + -i^l|V,/(x)"' 



Viiv.,/(-)ii:': 



where in the third inequality we used that aa-\-{\—a)b < max{a, b} for all a £ [0, 1]. 
Now, note that the function giiy^j) = f{x + y.^ - a;.y) - f{x) - {'Vf{x),y^j - Xij) 
satisfies the Assumption [1] (i). If we apply the above inequality to gi{yij) we get 
the following relation: 

f{x + yij-Xij) > f{x) + {Vf{x),yij-Xij) + -^ \\^ijfix + yij - Xij) -Vijf{x)\\*\ 

On the other hand, applying the same inequality to g2{xij) = f{x) — f{x + j/y — 
Xij) + {\7f{x + yij — Xij),yij — xjj), which also satisfies Assumption [1] (i), we have: 

f{x) > f{x + yij - x^j) + (Vfix+yij - Xij),yij ~ x,j} + 

1 II 11*2 

^ijf{x + yij - Xij) ~\/ijf{x)\\ . 



2L" 

Further, denoting Sij = yij — Xij and adding up the resulting inequalities we get: 
11^^/(2: + s,j) - Vij/(a;)||*^ < {Vijf{x + s.^) - Vijf{x),Sij) 

< l|Vjj/(a: + Sy)- Vij/(a;)[|* \\sij\\^, 
for aU x€W and G , which proves the statement of this lemma. □ 

It is straightforward to see that we can obtain from Lemma [1] the following in- 
equality (see also [18]): 

fix + s,,) < fix) + (y,,fix),s,,) + ^ \\s,,\\l , (5) 
for aU X e R", s„- G K"'+"j and a G [0, 1]. 
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2 Random coordinate descent algorithm 

In this section we introduce a variant of Random Coordinate Descent (RCD) 
method for solving problem ([T|) that performs a minimization step with respect to 
two block variables at each iteration. The coupling constraint (that is, the weighted 
sum constraint a^x = 0) prevents the development of an algorithm that performs a 
minimization with respect to only one variable at each iteration. We will therefore 
be interested in the restriction of the objective function / on feasible directions 
consisting of at least two nonzero (block) components. 

Let be a two dimensional random variable, where i,j G {1, . . . , A*'} with i ^ 
j and Pi^j^. = Pr((i,_j) = {ik,jk)) be its probability distribution. Given a feasible 
X, two blocks are chosen randomly with respect to a probability distribution pij 
and a quadratic model derived from the composite objective function is minimized 
with respect to these coordinates. Our method has the following iteration: given 
a feasible initial point a;", that is a^x'^ = 0, then for all fc > 

Algorithm 1 (RCD) 

1. Choose randomly two coordinates {ik,3k) with probability pj^.j,. 

2. Set x'^^^ = x'^ + Ui^di,^ + Uj^dj^, 

where the directions di^ and dj^ are chosen as follows: if we use for simplicity the 
notation (i,j) instead of {ik,jk)j the direction dij = [df dj]'^ is given by 

dy=arg min f{x'') + {Vijf{x''),Sij) + -^\\sij\\ +h{x'' + Sij) 

Sij GK"*+"J ^ 

s.t.: aj Sj + aj Sj = 0, 

where e M"' and aj e M"^ are the ith and jth blocks of vector a, respectively. 
Clearly, in our algorithm we maintain feasibility at each iteration, i.e. a'^x'^ = 
for all fc > 0. 

Remark 1 

(i) Note that for the scalar case (i.e., N = n) and h given by ^ or the 
direction dij in ^ can be computed in closed form. For the block case (i.e., 
Tii > 1 for all i) and if /i is a coordinatewise separable, strictly convex and piece- 
wise linear/quadratic function with 0(1) pieces (e.g., h given by there 
are algorithms for solving the above subproblem in linear-time (i.e., 0{ni +nj) 
operations) [S^ . Also for h given by ^ , there exist in the literature algorithms 
for solving the subproblem ([6]) with overall complexity 0{ni -\- rij) [2lll2]. 

(ii) In algorithm (RCD) we consider (i, j) = [j, i) and i ^ j. Moreover, we know that 
the complexity of choosing randomly a pair (i, j) with a uniform probability 
distribution requires 0(1) operations. □ 

We assume that random variables {ikt jk)k>o S'i's i.i.d. In the sequel, we use nota- 
tion ■q'' for the entire history of random pair choices and (p'^ for the expected value 
of the objective function w.r.t. 77*^, i.e.: 

v'' = {(«o, jo), • • • , (ifc-i, jfc-i)} and (p'' = E^k \f{x^)\^ . 



(6) 
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2.1 Previous work 

We briefly review some weU-known methods from tlie literature for solving the 
optimization model In [28,29,30_ Tseng studied optimization problems in the 
form ([U and developed a (block) coordinate gradient descent(CGD) method based 
on the Gauss-Southwell choice rule. The main requirement for the (CCD) iteration 
is the solution of the following problem: given a feasible x and a working set of 
indexes J, the update direction is defined by 

dnix; J) = arg min/(a:) + {Vf{x),s) + l-{Hs,s) + h(x + s) 
s.t.: a^s = 0, Sj =0 V j ^ J, 

where H G E"^" is a symmetric matrix chosen at the initial step of the algorithm. 

Algorithm (CGD): 

1. Choose a nonempty set of indices J''^ C {1, . . . , n} with respect to the 
Gauss-Southwell rule 

2. Solve © with x = x'' , J = , H = Hk to obtain d'' = dn^ {,x^\ J^) 

3. Choose stepsize a'' > and set x'^^^ = x'' + a'd'^. 

In [30], the authors proved for the particular case when function h is piece-wise 
linear/quadratic with 0(1) pieces that an e-optimal solution is attained in 0{ "^^^i ) 
iterations, where _Ro denotes the Euclidean distance from the initial point to some 
optimal solution. Also, in [30] the authors derive estimates of order 0{n) on the 
computational complexity of each iteration for this choice of h. 

Furthermore, for a quadratic function / and a box indicator function h (e.g., 
support vector machine (SVM) applications) one of the first decomposition ap- 
proaches developed similar to (RCD) is Sequential Minimal Optimization (SMO) 
[21| . SMO consists of choosing at each iteration two scalar coordinates with re- 
spect to some heuristic rule based on KKT conditions and solving the small QP 
subproblem obtained through the decomposition process. However, the rate of con- 
vergence is not provided for the SMO algorithm. But the numerical experiments 
show that the method is very efficient in practice due to the closed form solution 
of the QP subproblem. List and Simon [Tl] proposed a variant of block coordinate 

descent method for which an arithmetic complexity of order 0( " ^ ° ) is proved 
on a quadratic model with a box indicator function and generalized linear con- 
straints. Later, Hush et al. [10] presented a more practical decomposition method 
which attains the same complexity as the previous methods. 

A random coordinate descent algorithm for model ([1]) with a = and h being 
the indicator function for a Cartesian product of sets was analyzed by Nesterov in 
[20| . The generalization of this algorithm to composite objective functions has been 
studied in [221123] . However, none of these papers studied the application of coor- 
dinate descent algorithms to linearly coupled constrained optimization models. A 
similar random coordinate descent algorithm as the (RCD) method described in 
the present paper, for optimization problems with smooth objective and linearly 
coupled constraints, has been developed and analyzed by Necoara et al. in [16] . 
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We further extend these results to Unearly constrained composite objective func- 
tion optimization and provide in the sequel the convergence rate analysis for the 
previously presented variant of the (RCD) method (see Algorithm 1 (RCD)). 

3 Convergence results 

In the following subsections we derive the convergence rate of Algorithm 1 (RCD) 
for composite optimization model ([TJ in expectation, probability and for strongly 
convex functions. 

3.1 Convergence in expectation 

In this section we study the rate of convergence in expectation of algorithm (RCD). 
We consider uniform probability distribution, i.e., the event of choosing a pair (i, j) 
can occur with probability: 

_ 2 

~ N{N-l)' 

since we assume that (i, j) = (j, i) and i ^ j e {1, . . . , A'^} (see Remark[l] (ii)). In 
order to provide the convergence rate of our algorithm, first we have to define the 
conformal realization of a vector introduced in [251126] . 

Definition 1 Let d, d! £ R", then the vector d! is conformal to d if: 

3upp{d') C supped) and d'jdj > Vj = 1, . . . ,n. 

For a given matrix A £ R™^", with m < n, we introduce the notion of elementary 
vectors defined as: 

Definition 2 An elementary vector of Null{A) is a vector d e Null{A) for which 
there is no nonzero vector d' £ Null{A) conformal to d and supp{d') ^ supped). 

Based on Exercise 10.6 in [26] we state the following lemma: 

Lemma 2 \26^ Given d £ NuU{A), if d is an elementary vector, then \supp{d)\ < 
rank{A) + 1 < m + 1. Otherwise, d has a conformal realization: 

d = d^ + hd^ 

where s > 1 and £ Null[A) are elementary vectors conformal to d for allt = 1, . . . , s. 

For the scalar case, i.e., N = n and ?n = 1, the method provided in |30] finds a 
conformal realization with dimension s < |supp(d)| — 1 within 0{n) operations. We 
observe that elementary vectors d* in Lemma [2] for the case m = 1 (i.e., A = a^) 
have at most 2 nonzero components. 

Our convergence analysis is based on the following lemma, whose proof can be 
found in [301 Lemma 6.1]: 

Lemma 3 \30^ Let h be coordinatewise separable and convex. For any x, x + d £ dom h, 
let d be expressed as d = d^ + ■ ■ ■ + d'^ for some s > 1 and some nonzero d* £ R" 
conformal to d for t = 1, . . . , s. Then, we have: 

s 

h{x + d) - h{x) > (ji{x + d*) - h{x)^ . 
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For the simplicity of the analysis we introduce the following linear subspaces: 



and 5* : 



de 



a^d = o|. 



A simplified update rule of algorithm (RCD) is expressed as: 

= X -\- Uidj + Ujdj. 

We denote by F* and X* the optimal value and the optimal solution set for 
problem ([1]), respectively. We also introduce the maximal residual defined in terms 
of the norm 11 ■ |L: 



Ra 



max < max \\x ~ x 
X I x*ex* " 



F{x) < F{x°) , 



which measures the size of the level set of F given by x'^ . We assume that this 
distance is finite for the initial iterate 2;°. 
Now, we prove the main result of this section: 

Theorem 1 Let F satisfy Assumption\l\ Then, the random coordinate descent algo- 
rithm (RCD) based on the uniform distribution generates a sequence x^ satisfying the 
following convergence rate for the expected values of the objective function: 



k + 



F{x")-F'- 



Proof For simplicity, we drop the index k and use instead of {ik,jk) s-i^d x^ the 
notation and x, respectively. Based on ([5]) we derive: 



F{x+) < fix) + {V,,f{x),d,,) + -f- \\d,,\\^ + h{x + d,j) 



= min f{x) + {Vijf{x),s,j) + ^\\si 



+ h{x + Sij). 



Taking expectation in both sides w.r.t. random variable and recalling that 

Pij = N(N-i) ^ we get: 



E^J [F{x+)] < E,. 



< E, . 



min fix) +{Vijf{x),Sij) + 
f{x) + {V,,f{x),s,,) + ^\\s,j 



— "sijlP + h{x + Sjj) 



+ h{x + sij) 



N{N-1)- 



(8) 



for all possible Sij G 5^^ and pairs with i ^ j € {1, . . . ,N}. 

Based on Lemma[2]for m = 1, it follows that any d G S has a conformal realization 

defined by d = X] d*, where the vectors d* £ 5 are conformal to d and have only 
t=i 
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two nonzero components. Thus, for any t = I, . . . , s there is a pair such that 
d* e Sij. Therefore, for any d £ S we can choose an appropriate set of pairs 
and vectors sfj G Sij conformal to d such that d = Y^sfj. As we have seen, the 

hi 

above chain of relations in ^ holds for any set of pairs (i, j) and vectors Sij e Sij. 

Therefore, it also holds for the set of pairs and vectors sfj such that d = J2 ^fj- 

id 

In conclusion, we have from (|8]) that: 



E., [F{x+)]<f{x)+ ^ ^_ (V/(x),^4>+^^||4|[+^M. + 4) , 

\ i-3 i-j i,j I 



for all d £ S. Moreover, observing that Lfj < 2L^ " and applying Lemma [3] in 
the previous inequality for coordinatewise separable functions ||-||^ and h{-), we 
obtain: 

E.j[F{x+)] </(x) + ^^^-^((V/(^),^4)+^ :^||4||2+^;,(^+4)) 

Lemma 3 9 ^ — - > ^ ^ — ^ j o 

- ^(^'^ + N{N- 1) E 4) + ^'""iiE 4 II' + 



2 



' 2 , 2 



Li-"||d||^ + M:c + d)), 

for any d £ S. Note that ([9]) holds for every d e S since ([8]) holds for any s^j G . 
Therefore, as ^ holds for every vector from the subspace S, it also holds for the 
following particular vector d £ S defined as: 

d = argmin/(a;) + {Vf{x),s) + L^-" \\s\\l + h{x + s). 
s£S 

Based on this choice and using similar reasoning as in [191123] for proving the 
convergence rate of gradient type methods for composite objective functions, we 
derive the following: 

fix) + {S/f{x), d) + L'-" \\d\\l + h{x + d) 

= minf{x) + (V/(x),y - x) + i^"" \\y - x\\l + h{y) 
v&s 

< minF{y) + L^~" \\y - x\\l 
yes 

< min F(/3x* + l3)x)+l3^L^~"\\x-x*\\'^ 

< uiin F(x) - /3(F(x) - F*) + IJ^L^'^rI 



*\2 



F{x) 



L^-'^Rl 
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where in the first inequality we used the convexity of / while in the second and third 
inequalities we used basic optimization arguments. Therefore, at each iteration k 
the following inequality holds: 



^.u. [n^"^')] <(i 



2 )F{x'') + 



N{N - 1) 



N{N - 1) 



{F{x^)-F*f 



Taking expectation with respect to r;^. and using convexity properties we get: 



fk+l 



F* <(1 



N{N - 1) 



N{N - 1) 



Fr 



F*) 



N{N - 1) 



(10) 



Further, if we denote A 



- F* and 7 = N{N ~ l)L^-°'Rl we get: 
(A'^f 



^k+i < ^fc _ 
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Dividing both sides with A''A''+^ > and using the fact that Z\''+^ < A'' we get: 

Finally, summing up from 0, . . . , fc we easily get the above convergence rate. □ 

Let us analyze the convergence rate of our method for the two most common 
cases of the extended norm introduced in this section: w.r.t. extended Euclidean 
norm (i.e., a = 0) and norm ||-||j^ (i-e., a = 1). Recall that the norm ||-||-^ is 
defined by: 

N 

\\x\\l=J2L^ 
i=l 

Corrollary 1 Under the same assumptions of Theorem\T\ the algorithm (RCD) gen- 
erates a sequence such that the expected values of the objective function satisfy the 
following convergence rates for a = and a = 1: 



a = 



a=l 



F < 



N^LRl 



k + 



F* < 



k + 



F{x")-F' 
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Remark 2 We usually have J?f < LR^ and this shows the advantages that the 
general norm has over the Euclidean norm. Indeed, if we denote by = 
maxx{max3,*gx* W^i — '■ F(x) < F{x°)}, then we can provide upper bounds 
on Ri < X]i=i ^i^i S'lid ^0 ^ X]i=i ''"i- Clearly, the following inequality is valid: 

N N 
'^L.rf < '^Lrf, 

1=1 i=i 

and the inequality holds with equality only for = L for all i = 1, . . . , A*'. We 
recall that L = max^Lj. Therefore, in the majority of cases the estimate for the 
rate of convergence based on norm is much better than that based on the 
Euclidean norm ||-||q. 



3.2 Convergence for strongly convex functions 

Now, we assume that the objective function in ([!]) is o-Q-strongly convex with 
respect to norm i.e.: 

F{x)>F{y) + {F'{y),x~y) + ^\\x-y\\l Vx,yeR", (11) 

where F' {y) denotes some subgradient of F at y. Note that if the function / is a- 
strongly convex w.r.t. extended Euclidean norm, then we can remark that it is also 
CTQ-strongly convex function w.r.t. norm ||-||^ and the following relation between 
the strong convexity constants holds: 



N N 



L 



|2 



> o-Q ||x - y||^ , 



which leads to 



0"a < — — 

- L" 



Taking y = x* in and from optimality conditions {F'{x*),x — x*) > for all 
X £ S we obtain: 

F{x)^F*>^\\x-x*\\l. (12) 

Next, we state the convergence result of our algorithm (RCD) for solving the 
problem ([T]) with o-Q-strongly convex objective w.r.t. norm 

Theorem 2 Under the assumptions of Theorem\l\ let F be also Oa-strongly convex 
w.r.t. IHIq,. For the sequence x^ generated by algorithm (RCD) we have the following 
rate of convergence of the expected values of the objective function: 



where 7 is defined by: 



F*<[1-^^^] (F(^^)-F*), 



— - — , otherwise. 
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Proof Based on relation ^ it follows that: 



2 



— min(/(^'=) + (V/(x'=), d> + L^-" \\d\\l + h{x'' + d)) . 



N{N - 1) 

Then, using similar derivation as in Theorem 1 we have: 



min/(a;'^) + (V/(a;'=),d) + i'"" + h{x'' + d) 

1-a II fell^ 

< minF(y) + L " m — 2: ' 

yes W \\a 

I 1 1 2 

< min F{l3x* + {l-l3)x'')+l3'^L^~°'\\x'' -x*\\ 



/3e[oa] 



2 r l-a fe * 
\\X — X 



< min Fix") - PiFix") - F*) + I3^L 

~ PelOS] II 

< min F(x'') +I3( MZiLJ! _ A ( F(xh - F*) 

~ /36[0,1] V / ^ ^ 



where the last inequality results from p2|l . The statement of the theorem is ob- 
tained by noting that /3* = min{l, jff^} and the following subcases: 



1. If /?* = ^^r^a and we take the expectation w.r.t. 77*^ we get: 

^'+'-^*<(l-4ZT^)(^'-^*)' (13) 

2. if /3* = 1 and we take the expectation w.r.t. 77*^ we get: 

2(1-^)" 



1 



iV2 



F*). (14) 



3.3 Convergence in probability 

Further, we establish some bounds on the required number of iterations for which 
the generated sequence x^ attains e-accuracy with prespecified probability. In order 
to prove this result we use Theorem 1 from [23] and for a clear understanding we 
present it bellow. 

Lemma 4 J23f Let > fee a constant, < e < and consider a nonnegative 
nonincreasing sequence of (discrete) random variables {C'^ }fe>o with one of the following 
properties: 

(1) £[^''+^1^''] < C*" - for all k, where c> is a constant, 

(2) ^[^''+^1^''] < (1 - i) C*" for all k such that ^'^ > e, where c> I is a constant. 
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Then, for some eonfidence level p £ (0, 1) we have in probability that: 

<e)>l-p, 
for a number K of iterations which satisfies 

^>^(l + logi)+2-|,, 



if property (1) holds, or 



K > clog 5-, 



if property (2) holds. 

Based on this lemma we can state the following rate of convergence in probability: 

Theorem 3 Let F be a Oa-strongly convex function satisfying Assumption\l\and p > 
be the confidence level. Then, the sequence generated by algorithm (RCD) using 
uniform distribution satisfies the following rate of convergence in probability of the 
expected values of the objective function: 

Pr{(t>^ - F* <e)>l - p, 



with K satisfying 



K > 



^log^^:^, a.>0, 



where 7 : 



==- , otherwise 



l-a 



Proof Based on relation (|10|) , we note that taking as = rj)^ — F* , the property 
(1) of Lemma|3]holds and thus we get the first part of our result. Relations (|13p and 
(|14p in the strongly convex case are similar instances of property (2) in Theorem 
Ufrom which we get the second part of the result. □ 



4 Generalization 

In this section we study the optimization problem but with general linearly 
coupling constraints: 

min F{x) (:= f{x) + h{x)) 
s.t.: Ax = 0, 

where the functions / and h have the same properties as in Assumption [1] and 
A G is a matrix with 1 < m < n. There are very few attempts to solve this 

problem through coordinate descent strategies and up to our knowledge the only 
complexity result can be found in [30] . 
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For the simplicity of the exposition, we work in this section with the standard 
Euclidean norm, denoted by |j-||Q, on the extended space M". We consider the set 
of all (m + l)-tuples of the form TV = (i^, . . . , 1™+^), where iP e {!,..., iV} for 
all p = 1, . . . , m + 1. Also, we define pf^ as the probability distribution associated 
with (m+ l)-tuples of the form JV. Given this probabihty distribution p^, for this 
general optimization problem (|15[1 we propose the following random coordinate 
descent algorithm: 



Algorithm 2 (RCD)aA 




1. Choose randomly a set of (m + l)-tuple A/jt = {i] 




with probability pj^^^ 




2. Set x'''^^ = x'^ + dj\j-^ , 





where the direction dj^^ is chosen as follows: 

dM, = arg mm/(a;'=) + {Vf{x''),s) + ^ \\s\\l + hix" + s) 
s.t.: As = 0, = Vi ^ A^. 

We can easily see that the linearly coupling constraints Ax = prevent the de- 
velopment of an algorithm that performs at each iteration a minimization with 
respect to less than m + 1 coordinates. Therefore we are interested in the class of 
iteration updates which restricts the objective function on feasible directions that 
consist of at least m + 1 (block) components. 

Further, we redefine the subspace S as S = {s e R" : As = 0} and additionally 
we denote the local subspace Sj\f = {s € M" : As = 0, Sj = Vi G A/"}. Note that 
we still consider an ordered (m + l)-tuple Afe = (ij., . . . , i™"*"^) such that ^ ij. for 
all p ^ I. We observe that for a general matrix A, the previous subproblem does 
not necessarily have a closed form solution. However, when h is coordinatewise 
separable, strictly convex and piece-wise linear/quadratic with 0(1) pieces (e.g., h 
given by ([2}) there are efficient algorithms for solving the previous subproblem in 
linear-time |30]. Moreover, when h is the box indicator function (i.e., h given by (|3])) 
we have the following: in the scalar case (i.e., N = n) the subproblem has a closed 
form solution; for the block case (i.e., N < n) there exist linear-time algorithms 
for solving the subproblem within Oi^^^j^^ rii) operations [2]. Through a similar 
reasoning as in Lemma[T]we can derive that given a set of indices Af = (i^, . . . , i^), 
with p > 2, the following relation holds: 

f{x + d^) < fix) + {Vf{x),d^) + ^ , (16) 

for all X G M" and dj^ £ R" with nonzero entries only on the blocks 
Here, Lf/ = Lji -!-•■• + I/i? . Moreover, based on Lemma[2]it follows that any d £ S 
has a conformal realization defined by d = X]t=i where the elementary vectors 
are conformal to d and have at most m+1 nonzeros. Therefore, any vector 
d £ S can be generated by d = Ylj\f^^' where the vectors sj^/ £ Sj^ have at most 
m+1 nonzero blocks and are conformal to d. We now present the main convergence 
result for this method. 



16 



I. Necoara, A. Patrascu 



Theorem 4 Let F satisfy Assumption Q] Then, the random coordinate descent al- 
gorithm (RCD)_\f that chooses uniformly at each iteration m + 1 blocks generates a 
sequence satisfying the following rate of convergence for the expected values of the 
objective function: 

Proof The proof is similar to that of Theorem [1] and we omit it here for brevity. 



5 Complexity analysis 

In this section we analyze the total complexity (arithmetic complexity [18]) of 
algorithm (RCD) based on extended Euclidean norm for optimization problem ([1]) 
and compare it with other complexity estimates. Tseng presented in [30] the first 
complexity bounds for the (CGD) method applied to our optimization problem 
([T|). Up to our knowledge there are no other complexity results for coordinate 
descent methods on the general optimization model ([1} . 

Note that the algorithm (RCD) has an overall complexity w.r.t. extended Eu- 
clidean norm given by: 

o[^) Oilmen), 

where 0{ifjCD) is the complexity per iteration of algorithm (RCD). On the other 
hand, algorithm (CGD) has the following complexity estimate: 

O (^) Oilcan), 

where 0{icGD) is the iteration complexity of algorithm (CGD). Based on the par- 
ticularities and computational effort of each method, we will show in the sequel 
that for some optimization models arising in real-world applications the arithmetic 
complexity of (RCD) method is lower than that of (CGD) method. For certain in- 
stances of problem ([T]) we have that the computation of the coordinate directional 
derivative of the smooth component of the objective function is much more simpler 
than the function evaluation or directional derivative along an arbitrary direction. 
Note that the iteration of algorithm (RCD) uses only a small number of coordinate 
directional derivatives of the smooth part of the objective, in contrast with the 
(CGD) iteration which requires the full gradient. Thus, we estimate the arithmetic 
complexity of these two methods applied to a class of optimization problems con- 
taining instances for which the directional derivative of objective function can be 
computed cheaply. We recall that the process of choosing a uniformly random pair 
(i,j) in our method requires 0(1) operations. 

Let us structure a general coordinate descent iteration in two phases: 
Phase 1: Gather first-order information to form a quadratic approximation of the 
original optimization problem. 

Phase 2: Solve a quadratic optimization problem using data acquired at Phase 1 
and update the current vector. 
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Both algorithms (RCD) and (CGD) share this structure but, as we will see, there 
is a gap between computational complexities. We analyze the following example: 

f{x) = ^x'^Z^Zx + q'^x, (17) 

where Z = [zi ... zn] G ]]j™x" has sparse columns, with an average p << n 
nonzero entries on each column zi for all i = 1, . . . ,n. A particular case of this 
class of functions is f{x) = ^ \\Zx — g||^, which has been considered for numerical 
experiments in |20] and [23] . The problem Q , with the aforementioned structure 
P7p of the smooth part of the objective function, arises in many applications: 
e.g., linear SVM [29], truss topology [T7], internet (Google problem) [20], Cheby- 
shev center problems [32] . etc. The reader can easily find many other examples of 
optimization problems with cheap coordinate directional derivatives. 

Further, we estimate the iteration complexity of the algorithms (RCD) and 
(CGD). Given a feasible x, from the expression 

V,f{x) = {z,,Zx)+q\ 

we note that if the residual r{x) = Zx is already known, then the computation 
of Vif{x) requires 0{p) operations. We consider that the dimension rii of each 
block is of order 0{^). Thus, the (RCD) method updates the current point x 
on 0{^) coordinates and summing up with the computation of the new residual 
r{x'^) = Zx'^, which in this case requires 0{^) operations, we conclude that up 
to this stage, the iteration of (RCD) method has numerical complexity C'(^). 
However, the (CGD) method requires the computation of the full gradient for 
which are necessary 0{np) operations. As a preliminary conclusion. Phase 1 has 
the following complexity regarding the two algorithms: 

(RCD). Phase 1 : 0{^) 
(CGD). Phase 1 : 0{np) 

Suppose now that for a given x, the blocks {\/if{x),Vjf{x)) are known for 
(RCD) method or the entire gradient vector V/(x) is available for (CGD) method 
within previous computed complexities, then the second phase requires the find- 
ing of an update direction with respect to each method. For the general linearly 
constrained model ([TJ , evaluating the iteration complexity of both algorithms can 
be a difficult task. Since in [30] Tseng provided an explicit total computational 
complexity for the cases when the nonsmooth part of the objective function h is 
separable and piece-wise linear/quadratic with 0(1) pieces, for clarity of the com- 
parison we also analyze the particular setting when /i is a box indicator function 
as given in equation (|3]). For algorithm (RCD) with a = 0, at each iteration, we 
require the solution of the following problem (see (O): 

min {yijf{x),Sij} + ^ 

s.t.: flj Si + aj Sj = 0, [I — x 

It is shown in [12] that problem psp can be solved in 0{ni + rij) operations. 
However, in the scalar case (i.e., N = n) problem (|18p can solved in closed form. 



llo 



(18) 
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Therefore, Phase 2 of algorithm (RCD) requires 0{j^) operations. Finally, we 
estimate for algorithm (RCD) the total arithmetic complexity in terms of the 
number of blocks A'' as: 

On the other hand, due to the Gauss-Southwell rule, the (CGD) method re- 
quires at each iteration the solution of a quadratic knapsack problem of dimension 
n. It is argued in [12] that for solving the quadratic knapsack problem we need 0{n) 
operations. In conclusion, the Gauss-Southwell procedure in algorithm (CGD) re- 
quires the conformal realization of the solution of a continuous knapsack problem 
and the selection of a "good" set of blocks J'. This last process has a different cost 
depending on m. Overall, we estimate the total complexity of algorithm (CGD) 
for one equality constraint, m = 1, as: 

O (^) Oi,n) 

Table 1: Comparison of arithmetic complexities for alg. (RCD), (CGD) and [101 
[Tl] for m = 1. 



Algorithm / m = 1 


h{x) 


Probabilities 


Complexity 


(RCD) 


separable 


1 


^^pNnLR^ -J 


(CGD) 


separable 


greedy 




Hush (To], List [14] 


box indieator 


greedy 





First, we note that in the case m = 1 and N << n (i.e., the block case) 
algorithm (RCD) has better arithmetic complexity than algorithm (CGD) and 
previously mentioned block-coordinate methods [lOIITl] (see Table 1). When m = 
1 and N = n (i.e., the scalar case), by substitution in the above expressions 
from Table 1, we have a total complexity for algorithm (RCD) comparable to the 
complexity of algorithm (CGD) and the algorithms from [101114] . 

On the other hand, the complexity of choosing a random pair in algorithm 
(RCD) is very low, i.e., we need 0(1) operations. Thus, choosing the working pair 
(i,j) in our algorithm (RCD) is much simpler than choosing the working set J 
within the Gauss-Southwell rule for algorithm (CGD) which assumes the following 
steps: first, compute the projected gradient direction and second, find the confor- 
mal realization of computed direction; the overall complexity of these two steps 
being 0{n). In conclusion, the algorithm (RCD) has a very simple implementation 
due to simplicity of the random choice for the working pair and a low complexity 
per iteration. 

For the case m = 2 the algorithm (RCD) needs in Phase 1 to compute coor- 
dinate directional derivatives with complexity 0{^) and in Phase 2 to find the 
solution of a 3-block dimensional problem of the same structure as p8|l with com- 
plexity 0{^). Therefore, the iteration complexity of the (RCD) method in this 
case is still C'(^). On the other hand, the iteration complexity of the algorithm 
(CGD) for m = 2 is given by 0{pn -h nlogn) [30]. 
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For m > 2, the complexity of Phase 1 at each iteration of our method stiU 
requires 0{^) operations and the complexity of Phase 2 is 0{^j^), while in the 
(CGD) method the iteration complexity is C'(m^n^) [3Uj . 

For the case m > 1, a comparison between arithmetic complexities of algo- 
rithms (RCD) and (CGD) is provided in Table 2. We see from this table that de- 
pending on the values of n, m and N , the arithmetic complexity of (RCD) method 
can be better or worse than that of the (CGD) method. 

Table 2: Comparison of arithmetic complexities for algorithms (RCD) and (CGD) 
for m > 2. 



Algorithm 


m = 2 


m > 2 


(RCD) 




(p+m)JV™nLiij^ 


e 


e 


(CGD) 


(p+log n)ii? LR^ 


m^n^ LRq 


e 


e 



We conclude from the rate of convergence and the previous complexity analysis 
that algorithm (RCD) is easier to be implemented and analyzed due to the ran- 
domization and the typically very simple iteration. Moreover, on certain classes 
of problems with sparsity structure, that appear frequently in many large-scale 
real applications, the arithmetic complexity of (RCD) method is better than that 
of some well-known methods from the literature. All these arguments make the 
algorithm (RCD) to be competitive in the composite optimization framework. 
Moreover, the (RCD) method is suited for recently developed computational ar- 
chitectures (e.g., distributed or parallel architectures). 



6 Numerical Experiments 

In this section we present extensive numerical simulations, where we compare our 
algorithm (RCD) with some recently developed state-of-the-art algorithms from 
the literature for solving the optimization problem ([T]) : coordinate gradient descent 
(CGD) [30], projected gradient method for composite optimization (GM) [19] and 
LIBSVM [7|. We tested the four methods on large-scale optimization problems 
ranging from n = 10^ to n = 10^ arising in various applications such as: sup- 
port vector machine (SVM) (Section 6.1), the Chebyshev center of a set of points 
(Section 6.2) and random generated problems with an £i-regularization term (Sec- 
tion 6.3). Firstly, for the SVM application, we compare algorithm (RCD) against 
(CGD) and LIBSVM and we remark that our algorithm has the best performance 
on large-scale problem instances with sparse data. Secondly, we also observe a more 
robust behavior for algorithm (RCD) in comparison with algorithms (CGD) and 
(GM) when using different initial points on Chebyshev center problem instances. 
Lastly, we tested our algorithm on randomly generated problems, where the non- 
smooth part of the objective function contains an ^i-norm term, i.e., ^J2i=i 
for some A > 0, and we compared our method with algorithms (CGD) and (GM). 
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We have implemented all the algorithms in C-code and the experiments were 
run on a PC with an Intel Xeon E5410 CPU and 8 GB RAM memory. In all algo- 
rithms we considered the scalar case, i.e., N = n and we worked with the extended 
Euclidean norm (a = 0). In our applications the smooth part / of the composite 
objective function is of the form (|17|) . The coordinate directional derivative at the 
current point for algorithm (RCD) Vif{x) = {zi,Zx) + g; is computed efficiently 
by knowing at each iteration the residual r{x) = Zx. For the (CCD) method, 
the working set is chosen accordingly to Section 6 in [29]. Therefore, the entire 
gradient at the current point, V/(a-) = Zx + q, is required, which is computed 
efficiently using the residual r(a;) = Zx. For gradient and residual computations 
we used an efficient sparse matrix-vector multiplication procedure. We coded the 
standard (CCD) method presented in [30] and we have not used any heuristics rec- 
ommended by Tseng in [29], e.g., the "3-pair" heuristic technique. The direction 
dij at the current point from subproblem ([6]) for algorithm (RCD) is computed 
in closed form for all three applications considered in this section. For computing 
the direction d}i{x\J) at the current point from subproblem ([Tj) in the (CGD) 
method for the first two applications we coded the algorithm from [12] for solving 
quadratic knapsack problems of the form (|18p that has linear time complexity. 
For the second application, the direction at the current point for algorithm (CM) 
is computed using a linear time simplex projection algorithm introduced in [llj . 
For the third application, we used the equivalent formulation of the subproblem 
(|7]) given in [SD], obtaining for both algorithms (CGD) and (GM) an iteration 
which requires the solution of some double size quadratic knapsack problem of the 
form (fTSll . 

In the following tables we present for each algorithm the final objective function 
value (obj), the number of iterations (iter) and the necessary CPU time for our 
computer to execute all the iterations. As the algorithms (CGD), LIBSVM and 
(GM) use the whole gradient information to obtain the working set and to find 
the direction at the current point, we also report for the algorithm (RCD) the 
equivalent number of full-iterations which means the total number of iterations 
divided by ^ (i.e., the number of iterations groups . . . ,a;'^"/^). 



6.1 Support vector machine 

In order to better understand the practical performance of our method, we have 
tested the algorithms (RCD), (CGD) and LIBSVM on two-class data classification 
problems with linear kernel, which is a well-known real-world application that can 
be posed as a large-scale optimization problem in the form (|T]) with a sparsity 
structure. In this section, we describe our implementation of algorithms (RCD), 
(CGD) [22] and LIBSVM [7] and report the numerical results on different test 
problems. Note that linear SVM is a technique mainly used for text classification, 
which can be formulated as the following optimization problem: 

min ^x'^ z'^ Zx — X -\- l\n r^ix) 
s.t.: X = 0, 

where l[o,c] is the indicator function for the box constraint set [0,C]", Z £ j^^x" 
is the instance matrix with an average sparsity degree p (i.e., on average, Z has 
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p nonzero entries on each column), a G R" is the label vector of instances, C 
is the penalty parameter and e = [1 . . . 1]"^ G M". Clearly, this model fits the 
aforementioned class of functions (|17p . We set the primal penalty parameter C = 1 
in all SVM test problems. As in [29], we initialize all the algorithms with = 0. 
The stopping criterion used in the algorithm (RCD) is: f{x^~^) — f{x^~-'^^) < e, 
where j = 0, . . . , 10, while for the algorithm (CGD) we use the stopping criterion 
fix*") - f{x''+^) < e, where e = IQ-^. 



Table 3: Comparison of algorithms (RCD), (CGD) and library LIBSVM on SVM 
problems. 



Data 


rr/m 


(RCD) 


(CGD) 


LIBSVM 






fuIl-itcr/obj/timG(min) 


itor/obj/timc(min) 


itcr/obj/timo(min) 




16100/122 
(P = 14) 


11242/-5698.02/2.5 


23800/-5698.25/21.5 


63889/-5699.25/0.46 


a8a 


22696/123 
(P = 14) 


2227S/-S061. 9/18.1 


3742S/-S061.9/27.8 


94877/-8062.4/1.42 


a9a 


32561/123 
(P = 14) 


15355/- 11431. 47/7. 01 


45000/-11431. 58/89 


78244/-11433.0/2.33 


w8a 


49749/300 
(P = 12) 


15380/-1486.3/26.3 


19421/-14S6.3/27.2 


130294/-1486.8/42.9 


ijcnnl 


49990/22 
(P = 13) 


7601/-S5S9.05/6.01 


9000/-S5S9.52/16.5 


15696/-8590. 15/1.0 


web 


350000/254 
(P = 85) 


1428/-69471.21/29.95 


13600/-272flfl. 68/748 


59760/-69449. 56/467 


covtyp 


581012/54 
(P = 12) 


1722/-337798.34/38.5 


12000/- 24000/480 


466209/- 
337953.02/566.5 


tcatl 


2.2 ■ 10^/10^ 
(p = 50) 


22S/-1654. 72/51 


4600/-473. 93/568 




tcat2 


10'''/5 ■ 10-* 
(P = 10) 


350/-508. 06/112. 65 


502/-507. 59/516. 66 





We report in Table 3 the resuhs for algorithms (RCD), (CGD) and LIBSVM 
implemented in the scalar case, i.e., N = n. The data used for the experiments can 
be found on the LIBSVM webpage ( http://www.csie. ntu.edu.tw/cjlin/libsvmtools/j 
datasets/). For problems with very large dimensions, we generated the data ran- 
domly (see "testl" and "test2") such that the nonzero elements of Z fit into the 
available memory of our computer. For each algorithm we present the final objec- 
tive function value (obj), the number of iterations (iter) and the necessary CPU 
time (in minutes) for our computer to execute all the iterations. For the algorithm 
(RCD) we report the equivalent number of full-iterations, that is the number 
of iterations groups a;°, a;"/^, . . . , a:*"'"/^. On small test problems we observe that 
LIBSVM outperforms algorithms (RCD) and (CGD), but we still have that the 
CPU time for algorithm (RCD) does not exceed 30 min, while algorithm (CGD) 
performs much worse. On the other hand, on large-scale problems the algorithm 
(RCD) has the best behavior among the three tested algorithms (within a factor 
of 10). For very large problems {n > 10®), LIBSVM has not returned any result 
within 10 hours. 
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Fig. 1: Performance of algorithm (RCD) for different block dimensions. 




For the block case (i.e., N < n), we have plotted for algorithm (RCD) on the 
test problem "a7a" the CPU time and total time (in minutes) to solve knapsack 
problems (left) and the number of full-iterations (right) for different dimensions of 
the blocks rii. We see that the number of iterations decreases with the increasing 
dimension of the blocks, while the CPU time increases w.r.t. the scalar case due 
to the fact that for > 1 the direction dij cannot be computed in closed form as 
in the scalar case (i.e., = 1), but requires solving a quadratic knapsack problem 
(|18|) whose solution can be computed in 0{ni + ?ij) operations |12| . 



6.2 Chebyshev center of a set of points 

Many real applications such as location planning of shared facilities, pattern recog- 
nition, protein analysis, mechanical engineering and computer graphics (see e.g., 
|32j for more details and appropriate references) can be formulated as finding the 
Chebyshev center of a given set of points. The Chebyshev center problem involves 
the following: given a set of points zi, . . . ,Zn G K™, find the center Zc and radius r 
of the smallest enclosing ball of the given points. This geometric problem can be 
formulated as the following optimization problem: 

min r 

s.t.: Il^i — Zc||^ < r Vi = l,...,n, 

where r is the radius and zc is the center of the enclosing ball. It can be immediately 
seen that the dual formulation of this problem is a particular case of our linearly 
constrained optimization model ([T|l: 

n 

min \\Zxf - V x, + l[o,oo)(a;) (20) 

i=l 

S.t.: e X = 1, 

where Z is the matrix containing the given points Zi as columns. Once an optimal 
solution X* for the dual formulation is found, a primal solution can be recovered 
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as follows: 

1/2 



-ll^^lr+Eii^^ii'^n ' 4 = zx*. (21) 



Fig. 2: Performance of algorithms (RCD), (GM) and (CGD) for 50 full-iterations 
and initial point ei (top) and ^ (bottom) on a randomly generated matrix Z £ 

j^2xl000 




The direction dij at the current point in the algorithm (RCD) is computed in 
closed form. For computing the direction in the (CGD) method we need to solve 
a quadratic knapsack problem that has linear time complexity [12] . The direction 
at the current point for algorithm (GM) is computed using a linear time simplex 
projection algorithm introduced in [TT]. We compare algorithms (RCD), (CGD) 
and (GM) for a set of large-scale problem instances generated randomly with a 
uniform distribution. We recover a suboptimal radius and Chebyshev center using 
the same set of relations (|2H) evaluated at the final iteration point for all three 
algorithms. 

In Fig. 2 we present the performance of the three algorithms (RCD), (GM) 
and (CGD) on a randomly generated matrix Z G jjSxiooo fuii_iterations 
with two different initial points: = ei (the vector with the first entry 1 and the 
rest of the entries zeros) and = ^- Note that for the initial point = ei, the 
algorithm (GM) is outperformed by the other two methods: (RCD) and (CGD). 
Also, if all three algorithms are initialized with 2;° = ^, the algorithm (CGD) has 
the worst performance among all three. We observe that our algorithm (RCD) is 
very robust against the initial point choice. 

In Fig. 3 we plot the objective function evaluation over time (in seconds) for 
the three algorithms (RCD), (GM) and (CGD) on a matrix Z G rSOxiooo^ 
observe that the algorithm (RCD) has a comparable performance with algorithm 
(GM) and a much better performance than (CGD) when the initial point is taken 
^. On the other hand, the algorithm (GM) has the worst behavior among all 
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Fig. 3: Time performance of algorithms (RCD), (GM) and (CGD) for initial point 
^ (left) and ei (right) on a randomly generated matrix Z £ jjSOxiooo^ 





CPU Time (sec) 



three methods when sparse initializations are used. However, the behavior of our 
algorithm (RCD) is not dependent on the sparsity of the initial point. 

In Table 4, for a number of n = 5 ■ 10^, 10^ and 3 ■ 10* points generated ran- 
domly using uniform distribution in K^'' and R^'^, we compared all three algorithms 
(RCD), (CGD) and (GM) with two different initial points: a;° = ei and 2;° = f • 
Firstly, we computed /* with the algorithm (CGD) using 1° = e\ and imposed 
the termination criterion /(x-*^) — /(x'^"'"^) < e, where e = 10^^. Secondly, we used 
the precomputed optimal value /* to test the other algorithms with termination 
criterion j{x^) — /* < 1 or 2. We clearly see that our algorithm (RCD) has supe- 
rior performance over the (GM) method and is comparable with (CGD) method 
when we start from = ei . When we start from x^ = ^ our algorithm provides 
better performance in terms of objective function and CPU time (in seconds) than 
the (CGD) and (GM) methods (at least 6 times faster). We also observe that our 
algorithm is not sensitive w.r.t. the initial point. 



6.3 Random generated problems with ^i-regularization term 

In this section we compare algorithm (RCD) with the methods (CGD) and (GM) 
on problems with composite objective function, where nonsmooth part contains an 
^i-regularization term A^"^^^ \xi\. Many applications from signal processing and 
data mining can be formulated into the following optimization problem [5l l22] : 



1 / " 

inin-a; Z Zx + q x + \\y ^ \xi\ + 1[; „](x) 



(22) 



s.t.: a X = h, 



where Z G R"'^" and the penalty parameter A > 0. Further, the rest of the 
parameters are chosen as follows: a = e, b = \ and —l = u=\. The direction 
dij at the current point in the algorithm (RCD) is computed in closed form. For 
computing the direction in the (CGD) and (GM) methods we need to solve a 
double size quadratic knapsack problem of the form p8|l that has linear time 
complexity [12| . 
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Table 4: Comparison of algorithms (RCD), (CGD) and (GM) on Chebyshev center 
problems. 



^0 














lull 1LI_1/ UIJJJ 1111Hj^&I_L^ 








5 


■ 10-^ 
10 


2064/-79.80/0.76 


4620/-79.80/5.3 


17156/-79.82/5.6 




10 


6370/-S4. 71/4.75 


9604/-84.7/23.2 


42495/-S4. 71/28. 01 


n 


3 


. 10* 
10 


13213/-87. 12/31.15 


27287/-86. 09/206. 52 


55499/-S6. 09/111. 81 




5 


■ 10'^ 
30 


4269/-205.94/2.75 


823/-132.08/0.6 


19610/-204. 94/13. 94 




10* 
30 


56S4/-211.95/7.51 


9552/-211. 94/33. 42 


28102/-210. 94/40. 18 




3 


- 10* 
30 


23744/-2 15. 66/ 150.86 


156929/-2 14. 66/ 1729.1 


126272/-2 14. 66/937. 33 




5 


■ 10-^ 
10 


2392/-79.81/0.88 


611/-80.8/0.77 


29374/-79.8/9.6 




10* 

10 


9429/-84. 71/7.05 


350/-85.2/0.S6 


60777/-84. 7/40-1 




3 


. 10* 
10 


13007/-87. 1/30.64 


615/-88.09/6.20 


1 2922 1/-86. 09/258. 88 




5 


■ 10^ 
30 


2682/-205.94/1.73 


806/-206.94/1.13 


35777/-204. 94/25. 29 




10* 
30 


4382/-211.94/5.77 


594/-212.94/2.14 


59825/-210. 94/85. 52 




3 


. 10* 
30 


16601/-215. 67/102. 11 


707 /-216.66/8.02 


191303/-214. 66/1421 



In Table 5, for dimensions ranging from n = 10^ to n = lO'^ and for m = 10, we 
generated randomly the matrix Z G r'"X" and q £ R" using uniform distribution. 
We compared all three algorithms (RCD), (CGD) and (GM) with two different 
initial points a;" = ei and = and two different values of the penalty parameter 
A = 0.1 and A = 10. Firstly, we computed /* with the algorithm (CGD) using 

= §^ and imposed the termination criterion /(a::'^) — /(a;'^+^) < e, where e = 10~^. 
Secondly, we used the precomputed optimal value /* to test the other algorithms 
with termination criterion f{x^) — f* < 0.1 or 1. For the penalty parameter A = 10 
and initial point ei the algorithms (CGD) and (GM) have not returned any result 
within 5 hours. It can be clearly seen from Table 5 that for most of the tests with 
the initialization a::" = ei our algorithm (RCD) performs up to 100 times faster 
than the other two methods. Also, note that when we start from x'^ = ^ our 
algorithm provides a comparable performance, in terms of objective function and 
CPU time (in seconds), with algorithm (CGD). Finally, we observe that algorithm 
(RCD) is the most robust w.r.t. the initial point among all three tested methods. 
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